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Abstract. A continuous time model for multiagent systems governed by reinforcement learning with scale- 
free memory is developed. The agents are assumed to act independently of one another in optimizing their 
choice of possible actions via trial-and-error search. To gain awareness about the action value the agents 
accumulate in their memory the rewards obtained from taking a specific action at each moment of time. 
The contribution of the rewards in the past to the agent current perception of action value is described 
by an integral operator with a power-law kernel. Finally a fractional differential equation governing the 
O l' system dynamics is obtained. The agents are considered to interact with one another implicitly via the 

reward of one agent depending on the choice of the other agents. The pairwise interaction model is adopted 
to describe this effect. As a specific example of systems with non-transitive interactions, a two agent and 
three agent systems of the rock-paper-scissors type are analyzed in detail, including the stability analysis 
and numerical simulation. Scale-free memory is demonstrated to cause complex dynamics of the systems 
at hand. In particular, it is shown that there can be simultaneously two modes of the system instability 
undergoing subcritical and supercritical bifurcation, with the latter one exhibiting anomalous oscillations 
with the amplitude and period growing with time. Besides, the instability onset via this supercritical mode 
may be regarded as "altruism self-organization". For the three agent system the instability dynamics is 



found to be rather irregular and can be composed of alternate fragments of oscillations different in their 
properties. 

PACS. 87.23. Ge Dynamics of social systems - 89. 75. Da Systems obeying scaling laws - 02.50.Le Decision 
> I theory and game theory - 05.65. + b Self-organized systems 
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. 1 Introduction evolutionary games [TT] . adaptive competition in a market 

[12] . as well as to establish the relationship between the 
^ '. During the last decades application of physical notions reinforcement learning and the replicator model of popu- 
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and the mathematical formalism of statistical physics to lation biology [HCE]. The latter ' in Particular, made it 

O ■ describing economic and social systems attracted much at- possible to analyze the complex behavior, including the 

tention in the scientific community (see, e.g., (J). The effi- onset of dynamical chaos for agent ensembles using the 

ciency of this approach has been demonstrated, in partic- techniques of dynamical systems [H1[TMI7] ■ 
ular, in modeling cooperative motion of vehicles in traffic It should be noted that the majority of models similar 

5-1 ' flow, pedestrian ensembles, and groups of animals with so- to ones listed above and constructed to capture character- 

9^ ', cial behavior [2] , dynamics of stock markets [3l|4] , opinion istic features of the social system dynamics invoke notions 

formation, culture and language evolution [5]. The multi- and concepts inherited directly from statistical physics, 

agent reinforcement learning problem is one of the promis- However, generally speaking, agent models imitating the 

ing techniques of modeling the evolution and adaptation of behavior of human beings should possess also their own 

complex systems in which human factor plays an essential features that are inapplicable to physical objects or, at 

role. Until now, this problem was studied mainly within least, anomalous from the standpoints of physics [H3[]]|]. 

the scope of artificial intelligence (for a review see [5])- One of such anomalous features is the impact of system 

Nevertheless, recently the concepts of statistical physics history implemented, in particular, via the effects of hu- 

were combined with the notions of reinforcement learning man memory in learning process as well as the evaluation 

to simulate the dynamics of minority games :7..v!) ID and of previous events in the decision making at the current 

moment of time. Comparing social systems and objects of 

a e-mail: ialubOf pi . gpi . ru classical physics we note that the latter ones have no mem- 

b e-mail: kanemoto@u-aizu.ac.jp ory in the following sense. If the positions and the veloci- 
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ties of all the particles forming a closed system are known 
at some moment of time chosen arbitrary, then its fur- 
ther dynamics is completely predictable at least formally. 
Randomness arises at the level of reduced description and 
even in this case the dominating approximation is Marko- 
vian random processes again without memory; given the 
state of a system at some moment of time, its probabilistic 
dynamics is completely determined. Broadly speaking, in 
the present paper we intend to focus on possible history 
effects in social systems whose dynamics can be imitated 
by multiagent reinforcement learning as well as to discuss 
mathematical notions relevant to their description. 

General arguments about human perception and event 
evaluation convince us to make use of a scale-free-memory 
model to describe the impact of system history on the re- 
inforcement learning. In this model the impact of events 
happened in the past (at time t') is weighted at the current 
moment of time t with some function K(t — t') showing 
a power-law decrease with the time difference t — t' . In 
fact, let us consider two events influencing our decision 
in a similar manner, which enables us to compare them 
with each other in assessing the current situation. If one 
of the two events happened one day before the current date 
whereas the other happened one week ago, then we will 
regard them as substantially different in time with respect 
to their contribution to our perception of the present sit- 
uation. By contrast, if the first event occurred one month 
and one day ago and the second event occurred one month 
and one week ago, we will draw no real distinction be- 
tween each other by the time of their occurrence. In other 
words, if the time lag between the two events is compara- 
ble with the time scale separating them from the present 
moment, then their impacts will be regarded as different 
in magnitude. On the contrary, if their time lag is much 
less then the passed time these events can be considered 
to be simultaneous. Exactly such a behavior is common 
to a power-law dependence K(t — t') cx (t — t')~ z with a 
certain exponent z > 0. 

This model is partly justified by the observed long- 
term memory effects in the scale-free foraging by primates 
[20,21,22 or insects [25J[2I] and the conclusion about the 
explicit relationship between scale-free foraging and the 
memory properties |25) . The human memory retrieval is 
also characterized by a scale-free pattern [2j5] in addition 
to the fact that it is composed of many distinct systems 
(for a review see [27]). Besides, stock markets, where hu- 
man factor is doubtless of great concern, exhibit similar 
effects, in particular, time correlations in the volatility of 
returns are characterized by a power-law decay (see, e.g, 

1231231)- 

The specific purpose of the present paper is to analyze 
how the scale- free memory can be introduced into the mul- 
tiagent reinforcement learning as well as to consider its 
characteristic effects in dynamics of multiagent systems 
with agent interaction similar to the rock-paper-scissors 
game. In such a system agents share a common environ- 
ment, so each time, when one of them takes some action, 
it disturbs the environment, causing the response of the 
other agents. The learning process enables the agents with 



long-term memory to follow the variations of the envi- 
ronment in optimizing their own actions. The rock-paper- 
scissors model determining the agent interaction was cho- 
sen for two reasons. 

First, this model is one of the simplest examples of 
non-transitive interactions: a rock beats a pair of scissors, 
scissors beat a sheet of paper, and paper beats a rock. So 
ordering the three objects according to their dominance 
we get a competitive cycle, which can be responsible for 
a variety of self-organization phenomena in social and bi- 
ological systems (see, e.g., [50] and references therein). 

Second, the rock-paper-scissors dynamics is found to 
be rather common for polymorphic populations, in partic- 
ular, Jamaican cryptic coral reef communities |31j . yeast 
populations |32| and population of marine isopods |33l 
154] , microbial laboratory communities [551155] , polymor- 
phic groups of side-blotched lizards [571155] , lek-breeding 
ruffs [59TT4T)] , and Gouldian finches [H] . This fact merits a 
more detailed discussion. 



Rock-paper-scissors game and speciation 

The trimorphism of such populations, wherein the species 
interaction is referred to as a biological rock-paper-scissors 
game [32] , is caused by basic properties of the frequency 
dependent selection being crucial in maintaining the bio- 
diversity (for a review see [43]). The frequency dependent 
selection (FGS), i.e., the rate of offspring generation de- 
pending on the relative volume of a given phenotype, is 
implemented in such communities via various mechanisms. 
These mechanisms are classified into two groups, the pos- 
itive and negative FDS, according to whether they cause 
this dependence to be increasing or decreasing, respec- 
tively. When mixtures of positive and negative FDS inter- 
act, a system can become destabilized and oscillate [43] , 

Keeping in mind the further constructions, we note 
that learning and innate behavior are ones of the main 
mechanisms responsible for the positive FDS, with both 
of them affecting each other [35J. For example, the ob- 
served rapid evolution of dispersal, i.e., movement of indi- 
viduals from their natal or previous breeding sites to new 
ones [33], requires the presence of heritable genetic vari- 
ations for traits affecting dispersal behavior and strong 
selection acting on these traits (see, e.g., a review [35J). In 
particular, in vertebrates, where dispersal is often con- 
sidered to be a plastic, condition-dependent trait with 
low heritability, a significant heritability in the disper- 
sal propensity, i.e. the probability of dispersal between 
habitat patches has been directly found for collared fly- 
catchers [31]. Learning to be efficient requires enhanced 
long-term memory, especially it concerns those species of 
birds and mammals that cache food for later use or birds 
that carry out regular seasonal migrations returning to the 
same breeding and wintering grounds, or even stopover 
sites. For example, Clark's nutcrackers possess memory 
for cache sites spanning more than 7-9 months [37J and 
for long-distance migratory garden warblers their mem- 
ory of particular feeding sites persists during at least 12 
months [3FJ • It is worthwhile to note also investigations of 
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the relationship between spatial use strategies and medial 
and dorsal cortical volumes in males of the side-blotched 
lizard as a criterion of the species memory capacity |49j . 
Males in this species occur in three morphs using differ- 
ent spatial niches: large territory holders, small territory 
holders and nonterritory holders with home ranges of the 
smallest size. According to the found results, the larger 
the used territory, the larger the medial and dorsal corti- 
cal volumes. 

The analysis of the geographic variations and evolu- 
tionary history of the side-blotched lizard with the rock- 
paper-scissors mating strategies [50] has thrown light on 
basic mechanisms governing speciation. Polymorphic forms 
within a population can be the starting material for new 
species [ST, 52 ( 53 , 54 and, in this case, the rock-paper- 
scissors FDS causing self-sustaining cycle variations main- 
tains the required polymorphism |50) . On one hand, the 
found pattern has turned out to contain also monomor- 
phic and dimorphic populations that resemble the morphs 
in trimorphic populations. This resemblance suggests that 
rock-paper-scissors cycles of morphs were destabilized, giv- 
ing rise to the irreversible loss and fixation of certain 
morphs and, thus, the rapid phenotype evolution. On the 
other hand, the phylogenetic analysis revealed that this 
trimorphism was maintained for millions of years, even 
across speciation events. Thereby, the rock-paper-scissors 
interaction of species in trimorphic populations seems to 
cause evolutionary stable cyclic variations in morphs that 
undergo repeated breakdown events [50]. In addition, an- 
alyzing a clade of dung beetles [55] it has been demon- 
strated that insects assumed to be dimorphic can form 
trimorphic populations with cyclic phenotype variations 
via two threshold mechanism. 

The model to be developed will demonstrate that long- 
term scale-free memory in the multiagent reinforcement 
learning governed by the rock-paper-scissors interaction 
can be responsible for complex dynamics. In particular, 
relatively regular oscillations can be alternated by oscilla- 
tions with a large amplitude and period growing in time 
and the system admits the instability onset via subcritical 
as well as supercritical bifurcation for different modes of 
perturbations. 



2 Agent memory and reinforcement learning 

2.1 Continuous time description of multi-agent 
dynamics 

Let us consider of a collection of N agents 21 = {e^} 
(i = 1,2, ...,N) that individually can take one of the 
actions from a set X = {x} of M-elements and act inde- 
pendently of one another. The preference of an action x 
for a given agent a is determined by the agent perception 
of its value Q a (x, t) gained by the current moment of time 
t in exploring all the actions X previously. 

Within the discrete-time approximation the agents are 
assumed to take new actions at the time moments t n — 
nA, where n — 0, 1, 2, . . . and A is the time step. The 



probability of choosing an action a; by a given agent a at 
time t is 



-l 



.x'ex 



(1) 



where the quantity 1 / (3 characterizes the perception thresh- 
old of the agents in evaluating their actions. If at the initial 
time to = (i.e. for n = 0) the agents have no information 
about the action value, then the condition 



la{x,t)\ t=t =0 



(2a) 



will hold for every agent a and action x. Otherwise, the 
initial condition 



la(x,t)\ t= 



■■to 



r a (x) 



(2b) 



describes the agent preliminary opinion about the action 
value. In numerical simulations to be described below con- 
dition (|2b|) was used with the quantities Q* a {x, t) set equal 
to some random numbers to disturb the system equilib- 
rium and induce transient processes. 

The system dynamics is governed by the learning of 
agents aimed at finding the most appropriate action via 
the trial-and-error search. Following 15, 1CTT7] we make 
use of a simple integrator algorithm of the reinforcement 
learning (see, e.g., [551157] ). First, it assumes the agents 
to accumulate local rewards received at one step to raise 
awareness about the value of the possible actions. Sec- 
ond, because of bounded capacity of the agent memory 
events in the past separated from the present by time 
scales exceeding a certain value T practically do not con- 
tribute to the awareness gained by the agents at the cur- 
rent time t. Third, according to expression ([I]) each agent 
explores more often actions in the vicinity of the action 
being optimal from its current point of view. So to recon- 
struct the value of the possible actions properly it should 
weight local rewards differently depending on the prox- 
imity of a given action to the optimal one. This reason- 
ing is also supported by the fact that people overweight 
low-probability events and underweight high-probability 
events in the description-based decision-making which sum- 
maries descriptions detailing all possible outcomes and 
their respective likelihoods of each option (for a review 
see, e.g., [581159] ^1 . To allow for this feature we introduce 
the coefficient 



W a {x,t) = W[P a {x,t)] 



(3) 



weighting the reward gained by the given agent a from 
taking the action x at time t depending on its current 
perception of the action value. Following the spirit of the 
prospect theory [60] and the update rule of frequency max- 
imum Q-value heuristics [B] the weight coefficient ([3]) is 
assumed to be a non-increasing function on the current 
probability P a (x,t) of the agent a taking the action x. 

The ansatz W(P) = 1 is widely met in literature. 
However, when the VF(P)-dependence is rather weak and 
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the agent memory is long enough the reinforcement learn- 
ing mechanism under consideration can give rise to the 
Nash equilibrium instability not due to the agent inter- 
action (Appendix [X]). It is just caused by a higher rate 
of the reward accumulation for actions with higher prob- 
ability. In the present paper we focus on the multiagent 
reinforcement learning whose complex dynamics is caused 
by the mutual interaction of agents possessing, in partic- 
ular, long-term memory. So, from this standpoint such an 
instability may be treated as a "malfunction" of the ana- 
lyzed algorithm and its possible influence on the studied 
phenomena should be eliminated. Keeping in mind results 
obtained in Appendix [A] we adopt the ansatz 



W a (x,t) 



Pa(x,t) 



(4) 



which matches the simplest model being free from insta- 
bilities of the Nash equilibrium not caused by the agent 
interaction and describes the accumulation of knowledge 
about the action value proceeding uniformly, on the aver- 
age, for all the possible actions. 

Taking into account the aforementioned features the 
following version of the difference-learning equation 



A 



P a (x, £ n ) 



Ra{x\X a ) 



A 
f 



Qa(x,t n ) + Q a (x,t n ) (5) 



is applied to update the agent awareness at the time mo- 
ments {t n }. The first term in expression ([3]) on the right- 
hand side describes the accumulation of the knowledge 
about the action x a taken by the agent a at time t n . Here 
5 XXa is the Kronecker delta, the payoff function R a (x\X a ) 
describes the reward normalized to unit time that the 
agent a = at gains from the action x provided all the 
other agents 



51, 



{ai, a 2 , . . . , flj-i, U, Oj+i, . . . , ajv} 



have taken the actions 



X„ 



{xi,x 2l . . ■ , Xi-i,U,x i+ i, . . . , x N } . 



and the cofactor 1/P a (x, i) weights the contribution of the 
action x. The second term is caused by the agent memory 
loss and in what follows the inequality 



Z\<T 



(6) 



will be assumed to hold beforehand. 

By virtue of inequality (O for the function Q a {x,t) to 
reach some saturation a large number of the system up- 
dates need to be executed. In other words, every agent 
gains awareness of the action value through many trials 
and, thus, explores, explicitly or implicitly, many options 
of the agent actions. It enables us, first, to average equa- 
tion ([5]) over all the possible points of the configuration 
space 

{X} = {xi,x 2 , ■ ■ ■ ,x N } , x. t e X 



assuming the probability of a particular configuration X 
occurring at time t to be determined by the expression 



N 



V(X,t)=Y[P l (x t ,t) 



(J) 



,;=i 



and, second, to treat the action value Q a {x, t) as a contin- 
uous function of the time t. Decomposition ([7]) is due to 
the adopted assumption about the mutual independence 
of the agents in taking actions. In this way ( Appendix |A"| 
the update rule ((SJ) is reduced to the following differential 
equation 



where 



d(la( ft ~ =r a (x,t)- ^=q a (x,t), 



q a (x,t) = Q a (x,t) - jj QoOs''*) 



(8) 



(9) 



x'ex 



and 



r a {x, 

^EEW II *V(».',*) (10) 



a'62l a 



is the reward rate gained by the agent a from taking the 
action x and measured relative to its value averaged over 
all the possible actions X. 

In what follows we will confine our consideration to the 
dynamics of the quantities q a (x,t) instead of Q a (x,t) for 
two reasons. One of them is the fact that the probabilities 
P a (x, t) of the agent choice can equally be treated as direct 
functions of q a {x,t). Indeed, substituting © into ([1]) we 
get 



P a {x,t) 



E 

Lx'ex 



(11) 



The other is the possibility to eliminate a strong homoge- 
neous growth of the action value from consideration which 
does not affect the system dynamics at all and, so, has 
no definite physical interpretation. The quantities q a {x,t) 
meet the equality 



E 9a(x,t) =0. 



(12) 



x£X 



for any agent a at each moment of time t as follows form 
definition ©. 

It should be noted that expression (fT0|) actually spec- 
ifies some autonomous operator r a {gi, q%, . . . , qjy} map- 
ping the quantities {qi} onto the reward rate 

r a (x,t) = r a {q 1 (x,t),q 2 (x,t), . . .,q N (x,t)} , (13) 

which holds also for a rather arbitrary form of the W(P)- 
dependence (Appendix [A)) . So, in fact, expression © is 
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an autonomous nonlinear equation. It forms the complete 
continuous-time description of the multiagent system at 
hand provided the payoff function R a (x\X a ) is known and 
the agent memory is characterized by the time scale T . 
To clarify the latter statement let us consider the integral 
representation of equation (JSJ). 



2.2 Memory models and the notion of initial conditions 

Using the method of variation of parameters the differ- 
ential equation ([5]) is reduced to the following Volterra 
integral equation 

t 

f , _ (*-«') , _ (t-t ) 

q a {x,t)= dt'e t r a {x,t')+e T q a (x,t ) , (14) 



where the function r a (x,t) is given by expression (I13[) . 
The Volterra equation (|T4")) can be interpreted as the ex- 
plicit formulation of the agent memory model character- 
ized by the time scale T. The former term on the right- 
hand side of (fT4")l specifies the accumulation of the agent 
knowledge about the action value during the time interval 
(to,t), whereas the latter one determines the evolution of 
the knowledge gained in the past. In fact, dealing with the 
whole history of the system we replace expression (ITU) by 
the corresponding integral over the semiaxis (— oo,t) 



<la{x,t) 



dt', 



(t-f) 
t r a (x,t'). 



(15) 



Then using to the property of the exponential function 



_ («-*') _ (t-t ) (tp-Q 
e t = e t ■ e T 



(16) 



we introduce the notion of initial conditions just setting 



(to-t') 

q a (x,t ) = dt'e T r a (x,t') , (17) 



converting (fT5]) back to (|14p . In the frameworks of this 
model all similar events within time span about T con- 
tribute to the agent perception equivalently and the func- 
tion K(t - t') = exp{-(i - t')/T} is the kernel of the 
integral operator (114[) weighting the current contributions 
of the events happened in the past. 

If the agent memory is described by another kernel 
K(t — t') not equal to the exponential function, i.e. 



q a {x,t) = / dt'K(t-t')r a (x,t') 



(18) 



then the property corresponding to equality (fTBj) does not 
hold and the notion of initial conditions becomes inappli- 
cable in the general case. 



However appealing, for example, to ecological systems 
discussed in Introduction we see that the notion of ini- 
tial conditions can have its own meaning independent of 
specific memory models. Habits of offspring being of mul- 
tifactorial nature originate, in particular, from a mixture 
of parental and environmental contributions. The parental 
effects include a non-genetic parent-offspring transmission 
of preferable strategies of behavior reflecting the aware- 
ness gained previously by parents. The environmental fac- 
tor is responsible for accumulating information of personal 
experience and its influence grows with increasing age. A 
discussion of these phenomena in connection with the bio- 
diversity formation can be found, e.g., in reviews [61,62, 
163j . Therefore in mathematical description of the cumu- 
lative effect of natal awareness and personal experience 
on species adaptation the time of birth is the appropriate 
point for introducing the relevant initial conditions. 

In the present paper we will construct initial condi- 
tions for the analyzed process of reinforcement learning 
presuming that there is a certain specific time moment 
to when this process is initiated. Keeping in mind the 
parent-offspring communication let us adopt the follow- 
ing three assumptions about the learning process of agents 
with scale-free memory. 

First, within a sufficiently long time interval T the 
agents remember the times {t'} at which events happened 
and their contribution at the current moment of time t is 
weighted by the kernel Kit - t') cx l/(t - t')^'^ with 
the exponent < 7 < 1. The latter inequality is due the 
fact that, on one hand, the agent preference should be a 
really cumulative effect of all the previous rewards, i.e. the 
integral 



dt' 



C- := [ dt'Kit -t')oc [ 



-T~< (19) 



has to diverge as formally T — > 00. On the other hand, 
the kernel K(t — t') must be a decreasing function. The 
estimate C_ of integral (|19l) can be regarded as a cer- 
tain capacity of agent memory relating the action value 
q a (x) ~ r a (x)C- to the mean rewards f a (x) gained by 
the agent a during this time. 

Second, on temporal scales larger than T the agents do 
not rank the events according to their occurrence times, 
they just fix these events in the memory. It is described 
by the replacement 



t-T 



t-T 



dt' K(t-t')r a (x,t')=>f a (x) / dt'K{t-t'). (20) 



So on such scales the integral 

rt-T 



C+~ / dt'K(t-t') 

J —OO 



(21) 



is to converge at the lower limit. In addition, its contribu- 
tion to the memory capacity should be of the same order, 
i.e. the estimate C+ ~ C_ must hold. The latter is the case 
if the kernel K(t - t') cx T 2 f/(t - i') 1+7 for (t - t') > T. 
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Here the factor T 27 is due to the continuity of the function 
K(t - t') at t - t' = T. 

Summarizing the two assumptions we state that the 
kernel K(t — t') of the scale-free memory model should 
exhibit the following asymptotic behavior 



K(t-t') 



^1-7 



{t-t'y-f 



for t-t' <T. 



(22a) 



and 



K(t-t') 



-7 



for t-t' >T. 



(22b) 



Here a certain "microscopic" time scale r has been intro- 
duced because the kernel K(t—t') must be a dimensionless 
quantity in the present constructions. Let us make use of 
the so-called 7-exponential function |64j or, what is the 
same, Rabotnov's function 165 



K{t - t') 



t-t' 
T 



(23) 



to construct the crossover between the given asymptotics. 
Here 

00 ^ 

^7,7(3) = E ( 24 ) 



fc=0 



rp + i) 7 ] ' 



is the Mittag-Leffler functions in two parameters and r{z) 
is the gamma function. In the limit 7 — ¥ 1, when all the 
events within the time scale T contribute equivalently to 
the agent perception at the current time, kernel (f2"3"f takes 
the exponential form, K(t — t') — s- exp[— (t — t')/T\. Fig- 
ure [1] illustrates the behavior of kernel ((23]) . 

Third, at the initial time tp the agents have no personal 
knowledge about the value of the actions X and can rely 
only on awareness gained previously in some way. It is a 
certain analogy to the situation described within the sec- 
ond assumption; only the fact that some event happened 
is essential, whereas its occurrence time is not known. 
So to quantify such pre-awareness we can deal with only 
some quantities q a (x,to) aggregating this information as 
a whole. To make it possible to measure the contributions 
from the pre-awareness and the personal experience in the 
same units let us introduce an effective reward rate f a (x) 
related to the quantity q a (x,to) by the expression 



to 



q a {x,t Q ) = r a (x) I dt'K(t -t') 



(25) 
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Fig. 1. A plot illustrating the excepted approximation (|23[) of 
the kernel K(t — t') vs the time difference (t — t') (upper panel) 
and the crossover between asymptotics (|22|l (lower panel). 



where the function 



K b (t - t ) 



to 



J dt'K(t-t') 



f dt'K(t ~t r ) 



(27) 



Using the Mittag-Leffler function in one parameter E 1 (z) 
defined via the series EH 



E 1 {z)=Y J 



00 u 
Z K 



k=0 



r(jk + i) 



(28) 



we can show directly that 



d 

dt 7 



t^ 

T 



thus, 



K(t-t') = (r 1 -^)—E 



t-t' 
T 



Then as time goes on the contribution of the pre-awareness Therefore formula (TJ7]) can be rewritten as 
evolves as 



to 



K b (t - t ) = E 1 



f a {x) / dt'K(t-t')=K b (t-to)q a (x,to) 



(26) 



t-t 
T 



(29) 



provided the kernel K (t — t') is given by expression (1231) . 
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Combing together the three assumptions we write the 
desired integral Volterra equation governing the multia- 
gent reinforcement learning with scale-free memory in the 
following form 



2a (x,t) = / dt' 



E~, 



to 



En 



(t-t'y-y 



t-t 
T 



r a (x,t') 



q a (x,t ). (30) 



As before, the former term in this equation specifies the 
accumulation of the agent knowledge about the action 
value gained via reinforcement learning, whereas the lat- 
ter one describes the evolution of the contribution from 
the pre-awareness. It should be noted that relative math- 
ematical constructions were discussed within the so-called 
temporal-difference algorithm of the reinforcement learn- 
ing [57]. 

Concluding the given section we underline once more 
that the introduced notion of initial conditions implies the 
existence of a certain special point in the agent "life" ; it 
is the moment when the agents start their activity for the 
first time and, thus, have no direct experience in taking 
these specific actions. In contrast, the memory model with 
a fixed time scale enables one to impose initial conditions 
on the system dynamics at any moment of time. 

2.3 Governing equation 

The obtained integral equation (|30l) can be converted into 
a differential equation with fractional time derivative us- 
ing fractional calculus. Namely, the known relationship 
between the Cauchy type problems for fractional differ- 
ential equations and the Volterra integral equations 64 
enables us to reduce ([50]) to the following equation 

c Dlq a (x,t) = T^rafat) -J-jq a (x,t) , (31) 

where the left-hand side is the Caputo fractional derivative 
of order 7 defined by the expression 



Cn7 



1 



dt' dq a (x,t') 



r(i- 7 )j (t-t'p 



dt' 



(32) 



Equation ([31]) should be subjected to the initial condi- 
tion @ or, more strictly, to the condition 



q a (x,t ) = q*(x) 



(33) 



with the quantities q^(x) = Q„(x) — {Qal x )) x gi ven be- 
forehand. Expression (1311) is the desired governing equa- 
tion of the mean field theory for the analyzed multiagent 
reinforcement learning with scale-free memory. 

We also point out that for the scale-free memory the 
description of the multiagent reinforcement learning is no 
longer reduced to the replicator equations of population 
biology. For this reduction to hold the governing equation 
of the reinforcement learning has to be of the first order 
in the time derivative. 



2.4 Pairwise agent interaction 

To complete the construction of the model at hand we 
need to specify the interaction of the agents which is de- 
termined by the payoff function R a {x\X a ). Let us confine 
our consideration to the pairwise approximation of the 
agent interaction [66; . In particular, biodiversity effects 
can be largely described in terms of pairwise interactions 
among species [6T1IB5] . however, multi-species interactions 
seem to be necessary in describing complex hierarchical 
communities [69U70] . In the model of pairwise agent inter- 
action the payoff function R a (x\X a ) is written as (cf., e.g., 

HQ) 

R a (x\X a )=p x a + J2 Pli- (34) 

a'62t„ 

Keeping in mind formula (|10[) determining the reward rate 
r a (x,t) as well as the identity 



X^aOM) - 1 



it is worthwhile to rewrite expression (|34[) in such a way 
that eliminates the terms not contributing to r a (x,t) and 
combines similar terms with one another. Namely, let us 
make use of the following replacements 



P X a--=P X a-(p V a) +E 



raa' / . \ raa' 

y \ ' yy 



and 



p xx , 

raa 



p xx , 



xy 
Paa> 



yx 
Pan' 



P VV , 
raa' 



where the notations 

(■••>,-s»-) ™ d (-L-iP £<-> 

y&X y,y'&X 
have been introduced. In this case expression ([TO]) becomes 

r a {x,t) = pl+ Yl Y,P X a X >Ax,t) (35) 

and without loss of generality in what follows we will pre- 
sume the equalities 

[Pi) =0, Ut) =0, Ui) ,=0 (36) 
y ^ 1 y > 1 y 

to hold beforehand. 



3 Rock- Paper-Scissors model 

In the present section we consider a simple system of the 
rock-paper-scissors type and demonstrate that the scale- 
free memory can give rise to a complex dynamics caused 
by instability modes with different time scales. The spe- 
cific model to be constructed mimics, in particular, some 
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Fig. 2. Diagram illustrating the agent interaction of the rock- 
paper-scissors type. 



characteristic features of mating behavior in trimorphic 
populations of the side-blotched lizard (see [75] and ref- 
erences therein); a general review on the role of sexual 
selection in maintaining biodiversity can be found in |73j . 
In these populations female preferences impact on the fre- 
quency dependent selection and, what is essential, change 
with the female age. Females of the side-blotched lizard 
exist in two distinct morphs, however, in describing the 
female choice based on multiple male traits three types 
of females can be singled out [TJ]. Males express three 
morphs that exhibit alternative strategies in intrasexual 
competition referred to as the rock-paper-scissors biolog- 
ical game (see Introduction). Recently, the mate choice 
governed by learning and accumulation of individual ex- 
perience has attracted theoretical attention, in particular, 
when of the optimal choice is disturbed in a varying en- 
vironment |75] or as a self-organization phenomenon in 
mutual male- female learning [76] ; a review of the preced- 
ing models and used notions is also presented in the cited 
papers. In this context the model under consideration can 
be regarded as an example illustrating the dynamics of fe- 
male mate choice affected by the female intrasexual com- 
petition. 



3.1 Model 

Let us focus on two systems comprising two agents {ai, 02} 
or three agents {01,02,03}, and the set of their possible 
actions consisting of three elements {x±, X2, X3}. All these 
actions on their own are considered to be equivalent for ev- 
ery agent, therefore, we set p x a = for all a and x. The im- 
plicit pairwise interaction of the agents with one another 
is determined by two factors. One of them is the struc- 
ture of rewards gained by a pair of agents taking different 
actions. These rewards are assumed to be determined by 
the rock-paper-scissors payoff matrix 



1 -1 
-1 1 

1 -1 



which is illustrated in Fig. [5] According to this payoff ma- 
trix if, for example, the agent ai takes the action x\ and 
the agent 02 takes the action X2, then the first agent re- 
ceives the benefit #12, whereas the second one loses this 
value. The other factor is the reward redistribution when, 



for example, the agents ai and 02 take the same action X{. 
To specify the latter factor we ascribe to each agent a its 
individual "power" < r) a < 1 and assign, for example, 
to the agents ai and 02 the rewards 



e 12 — 312^12 



m + m 



e 21 — <?2lW21 



172 



m + m 



(37) 



respectively, with the interaction constants gn' and un' > 
being symmetrical with respect to permutation of in- 
dices. 

Summarizing these assumptions and following the ac- 
cepted renormalization of the payoff function R a {x\X a ) 
(Sec. l2.4[ ) we write the interaction matrix p^, in the form 



Paa' — 9aa' 



1 



-1 
1 - 



3 fcaa 



-1 - 



1 



3 taa 



(38) 



In addition the time scale T characterizing the ability of 
the agent memory and introduced in Sec. l2.2l is set equal to 
infinity in order to study the effects of scale-free-memory 
on their own. 

Under such conditions the governing equation (1311) be- 
comes 



T^ 1C Dlq a (x,t) 



E 

a'62l a 



n xx' J3q a , (x ,i) 
raa' c 



09a' (*".*) 



This equation possesses the stationary solution 

CM = 



(39) 



(40) 



for every agent a and action x, which matches the Nash 
equilibrium attained when all the actions are equivalent 
in value and, thus, -P„ q (a;) = 1/3. 

In what follows equation (|39l) will be analyzed with re- 
spect to the stability of the system dynamics and develop- 
ment of a possible instability will be studied numerically. 
In addition, to single out the affects that are due to the 
scale-free memory solely the further analysis will be con- 
fined to the case of identical agents setting g aa i = g and 
£aa' = £a'a = ( for any pair of agents a and a'. 



3.2 Linear stability analysis 

In the given multiagent system there are two sources of 
equilibrium perturbations. One of them is the deviation of 
the initial values g*(x) from q° q (x), the other is random 
variations in the reward rates reflecting the probabilistic 
nature of the agent choice. Since the initial conditions can 
be imposed on this system only at a certain specific mo- 
ment of time the two types of perturbations need to be 
considered separately. 

The present paper concerns the deterministic descrip- 
tion of the agent choice dynamics and the equilibrium per- 
turbations are introduced via the initial condition (1331) . So 
in this section we explicitly analyze the system stability 
with respect to the first type perturbations, the second 
type perturbations are considered in Appendix [Bj 
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The eigenfunctions of the Caputo fractional derivative 
operator (|32[) meeting the Cauchy initial condition of type 
(133")) can be written in terms of the Mittag-Lcffler function 
in one parameter, namely, E 7 [X(t — to) 7 ], where A is the 
corresponding eigenvalue being a complex number in the 
general case. This follows from the identity [53] 



C D 7 £ 7 [A(t - t ) 7 ] = A£ 7 [X(t - t y 



(41) 



The known asymptotic behavior of the Mittag-Leffler func- 
tion E 1 (z) of order < 7 < 1 [53] enables us to represent 
the asymptotics of these eigenfunctions for t — > 00 as 



E 1 (At 1 



= l e (xv>t) 
1 



oil 

in 



(42a) 



when the argument of the eigenvalue A lies in the interval 
|arg(A)| < 771-/2 and 



E 1 (Xf 



xr(i -i)-m 



o 



1 



(42b) 



for 771-/2 < |arg(A)| < 7r. According to expressions (|42l) 
the instability occurs when the governing equation (|39l) 
admits the eigenvalues meeting the inequality |arg(A)| < 
771-/2. In this case, by virtue of (|42a[) . the perturbation 
growth is exponential. The asymptotics (|42bl) matching 
the stable system dynamics describes the power-law decay 
of perturbations. 

Therefore linearizing equation (1391) near the stationary 
point (|40p we seek its solution in the form 



q a {x,t) = 6lE 1 [X{t-t y] 



(43) 



where {0%} are some constants. In this way the eigenvalue 
problem for equation (|39l) is reduced to finding the eigen- 
values h of the matrices 



T-2 



Op 



■Fa = 



Opp 
pOp 
ppO 



(44) 



for the systems with two and three agents, respectively, 
where the notation 



P = 



|e 1- \e -1- \e 
-l-\e |e l-±e 
1-4 -l-±e 2 -! 



(45) 



stands for matrices (|38[) in the case under consideration. 
The eigenvalues h and A are related to each other via the 
expression 

A = i hgr 1 ^ . (46) 

In addition, by virtue of (|12j) the corresponding eigenvec- 
tors are to meet the equality 



(47) 
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0.8 
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H 
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Fig. 3. Instability diagram for the analyzed systems with iden- 
tical agents and T — ► 00. 



for every agent a. Using Wolfram Mathematica 7 we have 
found the desired collection of the eigenvalues 

t-e±iV3,e±iV3,\ (48a) 

for the two agent system and 

|-e±i\/3,2(e±iv / 3)} (48b) 

for the three agent system, with the first two eigenvalues 
being doubly degenerate. 

Whence it follows that both the two systems become 
unstable when arg(e + iVo) < r yn/2, i.e. the inequality 



2 /VT 

7 > — arctan — 



(49) 



holds. Figure [3] depicts this condition. 

As demonstrated in Appendix[B]the found condition of 
the system instability holds also in the case when the equi- 
librium perturbations are caused by random fluctuations 
in the reward rates. So the complex system behavior found 
in the given deterministic model with dissipation cannot 
change drastically under the influence of small perturba- 
tions in the reward rates. 



3.3 Numerical simulation and the results 

3.3.1 Algorithm 

Under the assumptions adopted in the previous section 
the governing equation (|39p can be rewritten in the di- 
mensionless form using the replacements 



t — > r„t . 



f3q a (x,t) -t q a {x,t) , 



where the characteristic time scale t p of the system dy- 
namics is 

T ' P = Tfik^ ■ (50) 
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In these terms it takes the form 



Dtn 12,3 — -p2,3 P2 : 3 



(51) 



where, for example, for the two agent system the vectors 
q2 and P2 denote the following collections of variables 

Q2 = {qi(x 1 ),qi(x 2 ),qi{x 3 ),q 2 (x 1 ), q 2 (x 2 ), 92(^3)} 
P 2 = {Pi (xi) , Pi (x 2 ) , i\ (x 3 ) , P2 (xi) , P 2 (x 2 ) , P 2 (x 3 )} 

and the components of these vectors are related as 

-1 



P a (x) = 



e q " ix ' 



,x'=l 



(52) 



by virtue of (|TT|) . 

The right-hand side of equation (l5Tj) is a function of the 
vector q2,3 whose derivatives with respect to the compo- 
nents of this vector are bounded from above for any value 
of q2.3- It enabled us to make use of explicit algorithms 
in numerical simulation of the system dynamics (for dis- 
cussion of this point see, e.g., [65,77,78,79 ). Namely, the 
governing equation (1511) was solved numerically using the 
explicit 2-FLMM algorithm of second order in A [79] based 
on the following discretization of equation (f5~T|) 



n-i 

4 7) q„„ fc 

k=l 



qo 



+ At? [(2 - 1) P(q„_ 1 ) + (| - l) P(qn-2)] • (53) 

Here the indices denote the time moments t n — nA (n = 
2,3,...) at which the corresponding quantities are taken, 
whereas the indices 2 or 3 labeling the systems under con- 
sideration are omitted for the sake of simplicity, {^^} are 
the coefficients entering the Griinwald-Letnikov definition 
of fractional derivatives specified, for example, via the fol- 
lowing recursive formula 



,(7) 



.(7) A ! + 7 



,(7) 
J k-i 



for k = 1,2,. 



and the coefficient 



n-1 

k=a 



(7) 

k 



(54) 



(55) 



The value qi at the first step of the iteration was calcu- 
lated as 

qi =q a + A^P(q a ) (56) 

and the initial value qo meeting equality (|12[) was set ran- 
domly to initiate system perturbations near the Nash equi- 
librium (1401). 



3.3.2 Instability modes 

In the given paper we confine ourselves to discussing var- 
ious modes of the system instability found numerically. 



Let us, first, present the results of simulation for the two 
agent system. Figure |4] depicts two modes A and B of the 
long-term dynamics gotten by varying the initial condi- 
tions. The shown curves were obtained for the parameter 
e = 0.25 and the memory exponent 7 = 0.91. On the sta- 
bility diagram (Fig. [3]) this point lies inside the instability 
region just near its boundary; for the given magnitude of 
the parameter e the critical value of the memory exponent 
is 7c fa 0.9087. 

The mode A is related to a stable limit cycle in the 
phase space Pi = {Pi(xi), Pi(x 2 ), Pi (£3)} arising when a 
mismatch between the actions of the two agents is remark- 
able (Fig. HI lower row). The mode A was found can arise 
in the stability region also, i.e., when 7 < 7c , in particular, 
for 7 = 0.905. Figure [3] characterizes the system stabil- 
ity only with respect to infinitesimal perturbations rather 
then perturbations with finite initial amplitudes. So, these 
results demonstrate us that the mode A of the system in- 
stability undergoes the subcritical bifurcation as the mem- 
ory exponent 7 increases. The periodic oscillations found 
in the subcritical region, 7 < j c , are rather similar in form 
to those shown in Fig. 21 Only when the memory exponent 
7 goes away from the critical value 7c (e) and comes close 
to a certain boundary of the absolute stability 7s (e) these 
oscillations exhibit more complex behavior. In particular, 
Fig. [5] (left column) depicts steady state oscillations ob- 
tained for 7 = 0.905. The corresponding trajectory of the 
system motion fills uniformly a certain neighborhood of 
the previous limit cycle. So such system motion and can 
be regarded as oscillations of the mode A shown in Fig. 2] 
whose amplitude undergoes some time variations. We have 
failed to find steady state oscillations for 7 = 0.904 and 
e = 0.25; all the perturbations induced by initial random 
conditions faded out. It allows us to estimate the bound- 
ary of the absolute instability as 0.904 < 7s (e) < 0.905 for 
the given value of the parameter e. So, when the mode A 
of the system instability arises as the memory exponent 
7 increases, its attractor seems to become rather complex 
in form instantly without a smooth transformation of a 
quasicircular line in the phase space Pi. 

The second mode B is related to the appearance of 
oscillatory instability undergoing the supercritical bifur- 
cation, i.e. arising only in the instability region 7 > 7c . 
It turns out that even in the close proximity to the in- 
stability boundary the simplified model (j5Tj) seems not to 
be able to describe a possible steady state dynamics of 
the type B instability. The found time pattern Pi [x\ , i) 
exhibits the agent preference of taking only one action to 
become stronger and stronger as time goes on; the same 
does the duration of this choice (Fig. right column). 
As also seen in this figure the mode B matches the syn- 
chronized behavior of the two agents. It is likely that such 
oscillations can be stabilized on temporal scale of order 
T by the capacity of the agent memory. In any case this 
feature is worthy of individual analysis. Now we can claim 
that, at least, the characteristic time scales of oscillations 
caused by the instability onset within the modes A and B 
differ from each other dramatically. 
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mode A with subcritical bifurcation mode B with supercritical bifurcation 




P 1 { Xl ) Py{x{) 

Fig. 4. Two modes of the long-term dynamics found for the two agent system. The upper row visualizes the dynamics of the 
agent a\ as a ternary phase portrait of its trajectory {Pi (x\,i), Pi(x2, t),Pi(x3,t)} and the middle row shows the corresponding 
time pattern Pi(xi,t). The agent ai and the action xi were chosen to exemplify the typical characteristics of the system 
dynamics. The lower row visualizes the correlations in actions of the two agents ai and ai in terms of the relationship between 
the probability of their choice of the action X\. The present data were obtained for e = 0.25 and 7 = 0.91. The critical value of 
the memory exponent is equal to 7 C ~ 0.9087 for the given magnitude of the parameter e. 



As the memory exponent 7 goes away from the insta- 
bility boundary j c inward the instability region the mode 
A becomes dominant, whereas the mode B looses its sta- 
bility. It is demonstrated in Fig. [5] (right column) visual- 
izing an example of the transient processes of the instabil- 
ity development that at the initial stage can be classified 
as the mode B and at the final stage converts into the 
mode A. Besides, the shown pattern being rather com- 
plex in structure enables us to presume that there can be 
other modes of the system instability which, at least, are 
met ast able. 

The found two modes A and B can be interpreted as 
a self-organization phenomenon of the "selfish" or "altru- 



istic" behavior of agents; the term "self-organization" is 
used because in the given model the agents act indepen- 
dently of one another. Indeed, within the mode A each of 
the two agents alternately plays the role of defector taking 
mostly the action "beating" the action chosen by the other 
agent. It is clearly demonstrated Fig. H] (lower row panels). 
Oppositely, within the mode B the agents cooperate with 
each other taking mainly the same action. This behavior is 
really altruistic because the reward gained by the defector 
exceeds the reward gotten in sharing the same action by 
the factor 2/e for the used system parameters. The "al- 
truism self-organization" seems to be due to the scale-free 
memory because we have failed to find the mode B in 
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time t. in units of t v time t, in units of t p 

Fig. 5. The ternary phase portrait of a trajectory {Pi (xi, i), Pi(x2, t), Pi(x3, t)} and the corresponding time pattern Pi (xi, t) of 
the probability oscillations within the mode A in the subcritical and supercritical regions of the instability onset. The presented 
data were obtained for the two agent system with the parameter e = 0.25. 




Fig. 6. The ternary phase portrait (in 6000 dots individually) of a trajectory {Pi(xi,t), Pi(x2,t), Pi (a?3, t)} and the correspond- 
ing time pattern Pi(xi,t) visualizing typical features of the instability development for the tree agent system. The presented 
data were obtained for e = 0.25 and two values 7 = 0.91 and 7 = 0.93 of the memory exponent. 
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simulating actually the same system where, however, the 
agent memory is characterized by a single temporal scale 
(Sec. 12. 2[ ). Up to now possible mechanisms responsible for 
the altruistic behavior are far from being understood well; 
a review of the corresponding phenomena and the recently 
proposed models can be found in [8THI8T] . At least, altruism 
does not readily evolve as illustrated by evolution of coop- 
eration in the prisoner's dilemma game [82 . In the given 
model the altruistic behavior matches the agent choice of 
one action for longer and longer time during the learning 
process. We note that similar dynamics was found in mod- 
eling cyclic variations in cooperator-cheat systems, where 
altruism and conspicuous tags of altruists are inherited 
via different mechanisms [53]. 

Figure [6] depicts typical details of the instability devel- 
opment for the three agent system. In this case the inter- 
action between the agents destroys the mode A and the 
system dynamics becomes irregular. However, as seen in 
this figure, near the instability boundary 7 C w 0.9087 for 
e = 0.25 the mode B, nevertheless, can survive and emerge 
after a sufficiently long transient process during which the 
instability development is repeated several times. As a re- 
sult the corresponding phase portrait in the form of dots 
exhibits some attraction of the system dynamics towards 
the origin visualized as a certain origin neighborhood of 
dot accumulation. Outside this narrow neighborhood of 
the instability boundary the system dynamics becomes 
more irregular. The corresponding phase portrait shown 
for 7 = 0.93 matches a rather uniform dot distribution 
over the phase space. The related time pattern, never- 
theless, again demonstrates the fact that the system from 
time to time returns to the origin and remains in its vicin- 
ity during a time interval determined by the instability in- 
crement. These results for the three agent system enable 
us to claim that the proposed multiagent model of the 
reinforcement learning with scale-free memory describes 
some anomalous mechanism forcing the system to return 
periodically to the origin and reside in its neighborhood 
during a remarkable time in spite of the origin being the 
unstable point. This mechanism could be a possible ex- 
planation for the observed evolutionary stable cyclic varia- 
tions in trimorphic populations altered regularly by break- 
downs and the formation of dimorphic or monomorphic 
populations (see Introduction). 



4 Conclusion 

A model for multiagent reinforcement learning with scale- 
free memory has been constructed. Its development was 
partly stimulated by our attempt to elucidate a way of 
describing memory effects in systems where living beings 
play a crucial role and which can be responsible for long- 
term phenomena observed in stock markets, scale-free for- 
aging etc. The model details, including the agent non- 
transitive interaction of the rock-paper-scissors type, re- 
flect characteristic features of trimorphic populations at- 
tracted much attention during the last decade in the con- 
text of speciation. 



The reinforcement learning model assumes that the 
agents accumulate their rewards gained from taking some 
actions to get awareness about their value and, in addi- 
tion, act independently of one another. The interaction 
arises implicitly via the reward of one agent depending 
on actions of the other agents. The probability of tak- 
ing an action is related to the gained awareness via an 
exponential function (Boltzmann model). The reward ac- 
cumulation is described by an integral operator with a 
power-law kernel. In mean field approximation the final 
governing equation with fractional time derivative is con- 
structed. The scale-free memory poses a question on the 
notion of initial conditions for such systems and a certain 
approximation allowing it has been found. Its key point is 
to relate the initial conditions with the time moment when 
the agents start their activity and, thereby, can have only 
an awareness inherited in some way. In this case the gov- 
erning fractional differential equation is shown to be of the 
Caputo type. 

The dynamics of systems comprising two and three 
identical agents with the rock-paper-scissors interaction is 
analyzed in detail. First, the system stability is studied an- 
alytically and, then, the instability development is inves- 
tigated numerically. In particular, it has been found that 
the longer memory, the more easily the Nash equilibrium 
looses stability, which is in agreement with the model of 
self-organization induced by dynamics of uncertainty |16) . 
Roughly speaking the agents with weak memory cannot 
recognize the presence of other agents and treat their in- 
fluence as random fluctuations in the environment. 

For the two agent system the numerical analysis of 
the instability has demonstrated the existence of two its 
modes significantly different in their properties. One mode 
matches a limit cycle in the system phase space, stable pe- 
riodic oscillations in the probability of agents choice, and 
a certain mismatch in the agent behavior. This mode un- 
dergoes the subcritical bifurcation and is dominant when 
the system parameters lie inside the instability region far 
enough from its boundary. The other mode undergoing 
the supercritical bifurcation describes oscillations in the 
agent preference whose amplitude and period grow contin- 
uously in time, at least, within the simplified model used 
in numerical simulation. These oscillations could be sta- 
bilized by the limit capacity of the agent memory, which 
however is worthy of individual investigations. The latter 
mode matches the synchronized behavior of the two agents 
and can be regarded as "altruism self-organization" . The 
studied transient processes enable us to assume that there 
should be other modes of the system instability which, 
however, seem to be metastable. These phenomena are 
due to scale-free memory, at least, for actually the same 
model where, however, the memory is described by a sin- 
gle scale, only the first instability mode has been found 
and it undergoes the supercritical bifurcation. 

In the three agent system the interaction of agents de- 
stroys the first mode and the system dynamics becomes 
irregular. However, near the instability threshold the sec- 
ond mode can survive and emerge after rather long and 
complex transient processes. For all the initial conditions 
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randomly generated in numerical simulation the dynamics 
of the three agent system exhibits anomalous attraction 
to the equilibrium point, namely, it periodically returns 
to the equilibrium point being unstable and resides in its 
vicinity during remarkable time intervals. This behavior 
resembles the observed dynamics of trimorphic popula- 
tions with evolutionary stable cycle variations alternated 
by breakdown events giving rise to dimorphic or monomor- 
phic populations. 

The obtained results enable us also to presume that to 
describe the observed complex phenomena in speciation a 
relevant model should deal with not only the proportions 
of species but also some variables quantifying the process 
of species individual learning in adapting to variations in 
the environment as well as the population structure. 

The authors appreciate the support of RFBR Grants 09-01- 
00736 as well as the research support R-24-4 from the Univer- 
sity of Aizu. 



A Mean field approximation 

Let us consider a generalized form of the update rule (JSJ 

Q a {x,t n+X ) = W[P a (x,t n )] -Sx^ARaixlXa) 

- j; Qa(x,t n ) + Q a (x,t n ) , (57) 

where the weight coefficient W [P a (x, t n )] is assumed, at 
first, to be a certain smooth function of the choice prob- 
ability P a {x,t n ). The system update at the step t n is de- 
termined by the collection of expressions (|57|) . where the 
agent index a and the action index x run over all their 
possible values independently of each other and the Kro- 
necker delta S XjXa reflects the choice of the action x a by 
the agent a at the time moment t n . 

According to the adopted assumption ([5]) the quanti- 
ties {Q a (x, t)} cannot change substantially on scales of or- 
der A. Therefore we can consider a composite step of the 
system update comprising k elementary steps such that 
l<fc< T/t and to write for it the following equation 



)a{x,tn +k ) = W[P a {x,tn)\A ^ 5 XXa R a {x\X a ) 

(58) 



x a ,x a e€ ak 



kA 

rp Qa \X: t n ) -f- Q a{x , tr, 



Here the symbol € a k stands for the set of actions 





at 


a 2 . 


■ a N 


1 


Xu 


X\2 ■ 


■ Xin 


2 


X21 


X22 ■ 


■ X 2 N 


k 


Xkl 


Xk2 ■ 


■ XkN 



(59) 



where xy is the action taken by the agent a,j at the el- 
ementary step t n +i of the system update. Each row in 



table (|59")) represents the set x a [J X a of actions at the cor- 
responding time from the standpoint of the agent a. 

The updated values Q a (x 1 t n+ k) are determined by a 
specific realization of the set € a k so, strictly speaking, they 
are random quantities. However when the time capacity 
T of the agent memory is high enough we can choose 
the number of elementary steps sufficiently large that any 
agent explore many options of its choice as well as the 
choice of the other agents. In this case the sum entering 
equation (|58[) may be replaced by the sum running over 
all the possible realizations of the set x a 1J X a with the 
weights specified by their probabilities 



p(x a {JX a \t) =P a (x a ,t) J[ P a '{x a ,,t). 



(60) 



It should be noted that it is exactly the point where the 
adopted assumption about the mutual independence of 
the agents in their actions has been taken into account. 
The replacement 



E 

,x a e€ a 



k P(x a \JX a \t 

x a ,X a 



(61) 



converts equation (I55|) into the following 



) = kA^W[P a ( 
x ^2R a (x\X a ) J| P a '(x a/ ,t n ) - —Q a (x,t n ) I 

*- a'eil a 



Q a (x,t n ) (62a) 



and treating the quantities {Q a (x, t n )} as continuous func- 
tions of time we get 



dQ a (x,t) 
dt 



W[P a {x,t)]P a {x,t) 



xJ2^a(x\X a ) Yl P a ,(x a >,t)- -Q,(l,f), (62b) 
X a a'eila 

which is the desired mean field approximation of the up- 
date rule ([ST]). 

Keeping in mind the reasoning presented in Sec. 12.1 
we convert from the quantities Q a (x,t) to the quantities 
q a (x,t) showing their relative variations, i.e., 



q a (x,t) = Q a {x,t) - — ^ Q a (x\t) . 



(63) 



x'ex 



In these terms equation (I62b[) reads 
dq a (x,t) 



dt 



r a (x,t) - —q a (x,t), 



(64) 
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where the relative reward rate is given by the expression 
r a (x,t) = W[P a (x,t)]P a {x,t) 



A',, 



■fi E W[P a (x,t)]P a (x,t) 



x'ex 



J2Ra(x'\X a ) Yl Pa'{x a >,t). (65) 



a'ea„ 



By virtue of relationship (jlip and expression (I55|) the rel- 
ative reward rate r a (x, t) is an autonomous function of the 
quantities {qi(x, t), qvfa, t), . . . , qN(x, t)}, thereby, the col- 
lection of equations (I64[) forms a closed autonomous model 
for the reinforcement learning under consideration. 

To complete the derivation of equation we need to 
specify the VF(P)-dependence. 



The TT(P)-ansatz 

Let us analyze the family of ansatze 



w(p) = p- 



(66) 



with a nonnegative exponent, v > 0, which allows for pos- 
sible overweighting of low-probability events. When the 
agents do not interfere with one another in their actions 
the payoff function R a (x\X) = R a (x) for any agent a does 
not depend on the choice of the other agents, thereby, the 
relative reward rate r a (x,t) in its list of arguments con- 
tains only the quantities q a {x). Let us consider in detail 
this situation when the set of possible actions consists of 
only two actions, X = {xi,X2}, which, in addition, are 
equivalent, i.e., R a (xi) = R a {x2) = R- Then using re- 
lationship (fTTj) the system of equations for a given 
agent a is reduced to the equation 



dq _ R sinh ((1 - v)Pq) q_ 
~db~~2 [cosh(^g)] 1 -" T 



(67) 



containing solely the variable q = [Q a {xi) — Q a (x2)]/2. 

As it must be, the value q = matching the Nash 
equilibrium is a stationary point of equation (167[) . However 
in the limit of large values of the parameter T the Nash 
equilibrium can lose the stability if v < 1. Namely it is 
the case when the inequality 



(1 - u)(3RT > 2 



(68) 



holds and equation (|67[) admits three stationary points, 
the point q = being now unstable and two new points 
±q s matching the stable dynamics of the given system in 
their vicinity. Figure [7] illustrates this situation for v = 0. 
This instability is an innate feature of the given algorithm; 
the learning processes does not lead to the hidden Nash 
equilibrium q — not due to some external perturbations 
induced, for example, by influence of the other agents but 
because of the fact that the given agent exploring more 




1 : tanh(/3q) 

2: (2q)/(RT) 
a : /3RT < 2 stable NE 
b : (3RT > 2 unstable NE 



value q (in units of 1//3) 

Fig. 7. Illustration of the instability mechanism for reinforce- 
ment learning governed by equation ((67]). 



often those options that seem to it more preferable just 
faster accumulates their rewords. 

To eliminate this mechanism of the Nash equilibrium 
instability from the analyzed model we adopt ansatz (|66p 
with v = 1 . In this case equation (|64j) with expression (1651) 
immediately leads us to the governing equation (|8]) and 
express (JTT 



B Equilibrium perturbations caused by 
random fluctuations in the reward rates and 
the system stability 

In order to analyze the impact of random fluctuations in 
the reward rate on the system dynamics we return to the 
integral equation ([50)1 and split the term r a (x,t) into two 
parts 

r a {x,t)=r r a {x,t) + r f a (x,t), (69) 

where the former one is, as previously, the autonomous 
function of the action values, r^^q a r(x, t)}, specified by 
expression (|10p and the latter summand represents small 
random fluctuations in the reward rate. Dealing with small 
perturbations in the system dynamics we may consider 
different components of the pattern r[(x, t) as well as the 
variations 5q a {x,t) in the action values induced by these 
components separately. Therefore, let us confine ourselves 
to the case when the random fluctuations r[(x,t) are lo- 
cated inside some internal interval ^1^2] (here t\ > to) 
and consider time scales such that t — to — to. Since 
no perturbations in q a {x, t) can occur before the time mo- 
ment t\ equation p0[) for 8q a {x,t) and t — to 3> t% — 
reads 



Sq a (x,t) ^t 1 ^ / dt' 



{t-t'y-i 



KM 



E. 



1,1 



(t-t y--> 



dt'r f (x,t'), (70) 



16 



Ihor Lubashevsky, Shigeru Kanemoto: Scale-free memory model for multiagent reinforcement learning 



where 5r^(x,t) is the regular component of the reward 
rate linearized with respect to small variations of q a (x,t) 
near the equilibrium point q° q (x). 

Again the known relationship between the Cauchy type 
problems for factional differential equations and the Volter- 
ra integral equations [M] enables us to state that the 
asymptotics of the perturbations 5q a {x,t) obeys the fol- 
lowing fractional differential equation 

Dl6q a (x, t) = r^Srlix, t) - -L_ S q a (x, t) , (71) 

by virtue of ([70)1 . Here the left-hand side is the Riemann- 
Liouville fractional derivative of order 7 defined by the 
expression 

t 

DlMx,t) := _L_^ J J^^t'). (72) 

to 

The solution 8q a (x,t) of equation (fTTj) meets a certain 
initial integral condition that, in the frameworks of our 
analysis, can be replaced by to the requirement 

]im(t-t ) 1 -iq a (x,t) = C, (73) 
t— >to 

where C is some constant [51] . 

The eigenfunction of the Riemann-Liouville fractional 
derivative that meets requirement (|73[) and possesses the 
eigenvalue A is the so-called 7-exponential function [64] 

A z ^ {t -t y--f E ^ [x{t ~ tor] (74) 

with the asymptotic behavior as t — > 00 

ei'^^.) +0 (^) (75a) 
for |arg(A)| < 77r/2 and 

< = - A 2 r ( -7)-^ + °(^) (?5b) 

for 77r/2 < |arg(A)| < n. 

The obtained results enable us to state the follow- 
ing. First, by virtue of (142]) and (1751 the instability onset 
caused by perturbations in the initial conditions or ran- 
dom fluctuations in the reward rate matches the eigen- 
values A of the corresponding fractional derivatives be- 
longing to the same region on the complex plane. Second, 
equations (|31l) and ([7T]) governing the dynamics of per- 
turbations induced by both of these mechanisms actually 
have the same form within the replacement D] Q D] a . 
Thereby the instability condition (|49l) found in Sec. [32] de- 
scribes also the instability onset induced by the random 
fluctuations in the reward rate. 
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